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Abstract 

In Phys. Rev. Letters, 73:2, 5 Dec. 94, Mantegna et al. conclude on the basis of Zipf rank frequency 
data that noncoding DNA sequence regions are more like natural languages than coding regions. We argue 
on the contrary that an empirical fit to Zipf 's "law" cannot be used as a criterion for similarity to natural 
languages. Although DNA is a presumably an "organized system of signs" in Mandelbrot's (1961) sense, 
an observation of statistical features of the sort presented in the Mantegna et al. paper does not shed light 
on the similarity between DNA's "grammar" and natural language grammars, just as the observation of 
exact Zipf-like behavior cannot distinguish between the underlying processes of tossing an M sided die or 
a finite-state branching process. 
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In Phys. Review Letters, 73:2, 5 Dec. 94, Mantegna 
et al. "extend the Zipf approach to analyzing linguistic 
texts to the statistical study of DNA base pair sequences 
and find that the noncoding regions are more similar to 
natural languages than the coding sequences" (p. 3169). 
Specifically, the authors analyze coding/noncoding DNA 
sequences and conclude that noncoding regions show a 
more Zipf-like behavior than coding regions. Asserting 
that "A remarkable feature of languages is Zipf 's law" 
(p. 3169), they further conclude that noncoding regions 
are more similar to natural languages than coding re- 
gions (p. 3170): 

The averages for each category support the 
observation that £ is consistently larger for 
the noncoding sequences, suggesting that the 
noncoding sequences bear more resemblance 
to a natural language than the coding se- 
quences. 

Their result has received popular notice in both Science 
(266, p. 1320, 25 Nov. 1994) and Scientific American 
(272(3), March, 1995). 

In this note we would like to argue that the Man- 
tegna et al. conclusion is rather farfetched. Noncoding 
DNA sequences do not show much similarity to natural 
languages. Rather, as far as one can judge from the ev- 
idence of the Mantegna et al. paper, all one can say — if 
their statistical analysis is not in question, which it may 
well be — is that noncoding DNA sequences and natural 
languages combine discrete symbols to form strings that 
obey Zipf's law. But this is of course what we knew from 
the outset. In particular: 

• Any number of random processes outputting dis- 
crete symbols can display Zipf-like behavior with- 
out bearing any resemblance to the special genera- 
tive processes currently believed to govern sentence 
formation (word sequences) in natural languages. 
In this sense Zipf's law is not peculiar to natural 
languages at all, and therefore cannot be used as a 
strong test for whether DNA, or anything else for 
that matter, has something "in common with nat- 
ural languages." Indeed, exactly this same point 
was made at length over 30 years ago by Mandel- 
brot (1961) in his familiar discussion of Zipf's law: 

Further, because statistical and gram- 
matical structures seem uncorrelated, in 
the first approximation, one might ex- 
pect to encounter laws which are inde- 
pendent of the grammar of the language 
under consideration. Hence, from the 
viewpoint of significance (and also of the 
mathematical method) there would be 
an enormous difference between: on the 
one hand, the collection of data that are 
unlikely to exhibit any regularity other 
than the approximate stability of the 
relative frequencies, when different sam- 
ples are compared [i.e., data leading to 
statistical laws like Zipf's law; our com- 
ments pn/rcb]; and, on the other hand, 
the study of laws that are valid for natu- 



ral discourse [the discovery of such laws 
being the goal of linguistics pn/rcb] but 
not for other organized systems of signs, 
(p. 213) 

As is also familiar and as we show by examples be- 
low, it is quite easy to generate Zipf-like distributions 
from very simple generative processes that are quite un- 
like natural languages, e.g., tossing an M-sided die and 
particular very simple finite-state branching processes. 1 
In short, although DNA is a presumably an "organized 
system of signs" in Mandelbrot's sense, an observation 
of statistical features of the sort presented in the Man- 
tegna et al. paper does not shed light on the similarity 
between DNA's "grammar" and natural language gram- 
mars, just as the observation of exact Zipf-like behavior 
cannot distinguish between the underlying processes of 
tossing an M sided die or a finite-state branching pro- 
cess. An empirical fit to Zipf's law cannot be used as a 
criterion for similarity to natural languages. 

• Zipf's law is given by fr = C where / is the fre- 
quency of any word, and r is its rank, with words 
arranged from most frequent to least frequent. In 
other words ln(/) = K — £ln(r), (with £ = 1). 
The authors find that £ is 0.286 for coding regions, 
and 0.386 for noncoding regions, and 0.57 for nat- 
ural languages. Without further statistical tests, it 
is not unreasonable to conclude that both coding 
and noncoding DNA sequences are more alike to 
each other than either is to natural languages, and 
that Zipf's law is violated. What is plainly required 
are the usual significance tests addressing precisely 
this question, e.g., the null hypothesis that coding 
£ is the same as natural language £. Since the vari- 
ances are clearly available, the authors or others 
should be possible to carry the required tests on 
the original data. 

• As a minor point, in fact the two measures used in 
the paper — Zipf behavior, and Shannon entropy — 
are exactly correlated. Therefore it is not surpris- 
ing that given Zipf-like behavior for noncoding se- 
quences, one would also observe that noncoding re- 
gions have lower entropy than coding regions. In 
effect, there is just one, not "two similar statistical 
properties" (p. 3172) that natural languages and 
noncoding sequences share (if they share it at all), 
namely, Zipf-like behavior (or lower entropy). 

For a finite number of "words," entropy is largest 
for a uniform distribution over word frequencies. 
The more skewed the word frequencies, the lower 



Indeed, as N. Chomsky points out (p.c), if we take a col- 
lection of English sentences and define "words" by taking the 
strings starting with, say, "e" and ending with "e" then the 
resulting, more random collection of "words" shows a better 
fit to Zipf's "law" — precisely because there are no interfering 
effects from the more organized features of natural language 
words. On this view, the closer fit of noncoding sequences to 
a Zipf distribution actually means that noncoding DNA se- 
quences are more random and more unlike natural languages 
than coding sequences — exactly the opposite conclusion that 
Mantegna maintain. 



the entropy. For coding regions (with £ = 0.286), 
the word frequencies fall off more slowly with rank 
than for noncoding regions (£ = 0.386). Conse- 
quently, coding regions will have have higher en- 
tropy and lower redundancy than noncoding re- 
gions. Having carried out a Zipf analysis and ob- 
tained £, one does not need to compute a separate 
entropy test. Yet the authors do so (as they rec- 
ognize implicitly in the caption of figure 3 of their 
paper). 

Putting aside these and other possibly grave statis- 
tical fallacies, in the remainder of this note we exhibit 
two random processes, one an M-sided die, the other a 
finite-state grammar, that are very different from each 
other yet yield exact Zipf distributions. We then review 
some of the many properties of natural languages not 
shared by these two processes. Consequently, even if we 
accept the results of the Mantegna et al. paper, the 
inference from Zipf-behavior to a similarity with natu- 
ral languages cannot be justified. As mentioned, these 
points have been discussed more than thirty years ago by 
Mandelbrot (1961), and we conclude with some historical 
remarks that underscore his results along with related, 
more recent work that has also examined Zipf-behavior 
in DNA sequences. 

1 Zipf s Law and Random Process: 
Some Examples 

Zipf 's Law and Random Processes 

To begin, let us consider two very different, simple ran- 
dom processes that both generate Zipf distributions: an 
M-sided die and a finite-state grammar. 

Let us first recall Zipf s "law" itself. Suppose there are 
M "words" in a system. These words might be generated 
in various combinations according to some underlying 
process, giving rise to a corpus of sentences, or more 
generally, word sequences. Since there are only a M 
words, each word would occur multiple times in a large 
(potentially infinite) corpus. One can then rank these 
words, from most frequent to least frequent. Let the 
frequency of the ith word be /;. If /; is proportional to 
i, the generative process is said to obey Zipf s law. 

Example 1: An M-sided die. 

Let the sequence of words be generated by throwing a 

biased M sided die. In particular, let the die be such 

that the probability of the ith side appearing on top is 

given by: 

l 

Prob[i 



Z^j=i j 
Now consider the following process: 

1. Toss the biased die. 

2. If the die shows j, output word Wj. 

3. Repeat 1. 

Clearly, this process generates a sequence of words 
where the first word is twice as likely as the second, three 
times as likely as the third, and so on. The process thus 
follows Zipf s "law" exactly. 



Example 2: Finite-State Grammars 

Next we consider a random process generating "sen- 
tences" in a completely different fashion from example 1, 
but still obeying Zipf s law. Rather than deal with the 
case of M words directly, we provide some intuition in 
the form of an example where M = 4. Suppose there 
are four words: wi,W2,ws, and 1D4. Sentences (word se- 
quences) are produced by combining words in some fash- 
ion according to a grammar. Let us assume that the 
generative process is as follows: 

1. Start at the root node of the annotated tree of fig. 1. 

2. At each node, choose to go down any of the con- 
nected branches (leading to a daughter node) with 
equal probability. Output the word W{ if the branch 
is associated with the number i. If the branch is as- 
sociated with e, output nothing (empty string). 

3. On reaching a leaf node, stop. 

The reader will recognize that this is a finite-state 
grammar. Every path starting from the root node gives 
rise to a sentence. There are 4! different paths, corre- 
sponding to 4! different leaves, giving rise to 4! possible 
sentence types. Since the paths are all equally likely, 
each of these sentences occurs with equal likelihood. 

However, due to the way in which the tree is con- 
structed, many paths yield the same sentence. For ex- 
ample, the two paths highlighted in the figure yield the 
same sentence, W2W1. The reader can check that such 
a grammar generates eight different sentences with the 
associated probabilities in table 1. 

If a corpus of sentences is generated with the proba- 
bilities shown in the table, then it can easily be shown 
that the word w\ occurs twice as often as u>2, three times 
as often as ws and four times as often as 1D4. In other 
words, if we plot word frequencies, then they would fol- 
low Zipf s law. 2 

In general, if there are M words, then one could con- 
struct a similar tree. Such a tree would have M! leaves, 
each leaf giving rise to a sentence. The branches could 
be numbered (as done in the case where M = 4) so that 
all the M! different permutations of M words can be 
generated. Now, as in the specific M = 4 case, we re- 
place some of the numbers by e, equivalent to outputting 
an empty string for that branch. Let us now argue that 
this replacement can be carried out and yields a gram- 
mar that generates a Zipf distribution. 

We first make the following observations to describe 
what M-tree looks like before any such replacements 
have been made. There are M branches at level 1. Each 
of these branches bears a label from 1 to M, and no two 
branches bears the same label. There are M(M — 1) 
branches at level 2. There are an equal number of 



2 Note that the probability of occurrence of each word is 
inversely proportional to its rank. In a finite corpus, the 
frequency of occurrence need not be exactly equal to the 
probability. However, the convergence of frequencies to their 
underlying expectations make it more and more likely that 
frequency-rank behavior will follow Zipf 's law as the number 
of sentences in the corpus increases, with convergence in the 
limit as the corpus size goes to infinity. 
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Figure 1: A tree diagram representation of a finite state grammar. 



Sentence 


Wl 


WlW-2 


W 2 Wl 


W 3 Wl 


W 3 W 2 Wl 


W4W1 


W4W2W1 


W4W3W1 


Prob. 


1/6 


1/12 


1/4 


1/6 


1/12 


1/12 


1/12 


1/12 



Table 1: Sentences generated by the finite state grammar of fig. 1, along with the probability with which they are 
generated. 



branches bearing each label from 1 to M. Consequently, 
M — 1 of the branches at level 2, are labelled i for every i 
from 1 to M. Similarly, there M(M-l)(M-2) branches 
at level 3, with (M - 1)(M - 2) being labelled i for ev- 
ery i from 1 to M. As mentioned before, there are M\ 
different leaves, each giving rise to a different sentence 
(assuming no label were replaced by e). Each sentence 
is M words long, a permutation of the M words with no 
repeated word. 

Next, consider how we replace the labels by empty 
strings e. Consider all the branches labelled j. Each 
time such a branch is traversed, the grammar outputs 
the word Wj . Suppose we chose to replace some of the 
j labels by e, leaving only a\ branches at level 1 still 
labelled, a 2 branches at level 2, and so on. We can then 
prove the following two theorems (given here without 
proof): 

Theorem 1 Suppose a\ branches at level 1 are still la- 
belled and the remaining branches are labelled e. Sim- 
ilarly, suppose a,2 are labelled at level 2, a% labelled at 
level 3, and so on. Then a fraction f of the total num- 
ber of paths through the tree yields a sentence containing 
the word Wj, where f is given by: 



a 1 



a. 2 



a-3 



M! 



M M(M-l) M(M-l)(M-2) 

Clearly, < a\ < 1; < a^ < (M — 1), and in general, 
< a j < ( M _ i \, ■ Given these constraints on the a 8 's, we 
can also prove the following: 

Theorem 2 Any fraction that can be represented as -^ 
where i is an integer between and M\ can be obtained by 
an appropriate setting for the 0{ 's under the constraints 
of Theorem 1. 

A consequence of these theorems taken together is 
that one can generate sentences in such a way that in a 



corpus the word Wj can be made to occur in only a frac- 
tion / = -jjy of the sentences. In particular, by choosing 
k appropriately, we can make the jth word, Wj occur 
with frequency 1/j in the text, thus following Zipf's law 
exactly. 

2 General Remarks and History 

2.1 Some Observations on the Structure of 
Natural Languages 

It is well known that natural languages possess many 
other special properties that are not tested by the Zipf- 
law behavior. In particular, while finite-state grammars 
obey Zipf's law, it has long been known that they do 
not capture most of the striking properties of natural 
languages: 

1. Finite-state grammars by algebraic definition can- 
not express hierarchical relationships, the acknowl- 
edged hallmark of natural languages. Recall that 
finite-state grammars are algebraically associative 
concatenative systems (see, e.g., Harrison, 1978); 
that is, if £ is a finite-state grammar, then Va, d,c£ 
S* , a ■ be G C iff ab ■ c G C, where • is the concate- 
nation operator. Such a system cannot even ex- 
press the fact that one and the same linear string 
of words, such as "the deep blue sky" can have 
at least two structural (hierarchical) bracketings: 
(the (deep blue) sky) and (the deep (blue sky)). In 
other words, finite-state grammars can express only 
linear precedence relations, not hierarchical rela- 
tions. (This demonstrates a failure of what Chom- 
sky, 1956, called "strong generative capacity.") 

2. Finite-state grammars, unlike natural language 
grammars, cannot generate arbitrarily deep center- 
embedded languages (see Chomsky 1956, 1986, and 
many other conventional sources). 



3. Under the currently best working assumptions, 
natural language grammars contain very specific 
constraint statements with proprietary theoreti- 
cal vocabularies unlikely to be duplicated in DNA 
"grammar," (e.g., one component, so-called "trace 
theory" is stated in terms of hierarchical struc- 
tural sentence properties and noun phrases, both 
not shared by DNA, as far as it is known). 3 

2.2 Previous work on Zipf's Law and on DNA 
word frequencies 

Both Zipf's law and its application to DNA sequences 
have a long history. We mention only a few of the 
relevant points here. In the 1950s, as summarized in 
Mandelbrot (1961), both Mandelbrot, Simon (1955), 
and Miller and Newman (1958), among others, explored 
the nature of the word-frequency relationship embodied 
in Zipf's law. In particular, Mandelbrot showed how 
Markovian models of discourse (subsets of finite-state 
models) can give rise to Zipf-like behavior. Mandelbrot is 
careful to note the well-known inadequacy of such finite- 
state models to describe linguistic rules. For example, he 
writes (p. 191) "the 'finite-state' model appears as rather 
shocking because of the well known existence of some 
long-range influences in discourse, such as those studied 
by grammar" . He advocates ways out of this difficulty 
while "acknowledging that the 'degree of validity' of the 
finite state model decreases as the 'wealth' of grammars 
increases." Mandelbrot also uses various information- 
theoretical arguments to suggest that Zipf's law is not 
peculiar to language, but extends to any coding scheme 
with a finite number of symbols — and therefore, can tell 
us relatively little about any coding scheme like DNA. 

As it turns out, there have also been many word- 
frequency analyses of DNA sequences. As Pevzner et 
al. (1989) point out, "Mathematical models of the gen- 
eration of genetic texts appeared simultaneously with 
the first sequencing [of sic pn/rcb] DNA" . Pevzner et 
al. (1989) actually address the key question of variance 
and significant differences explicitly, proposing formulae 
for the variance of number of word occurrences in texts, 
making it possible to assess the significance of deviations 
from expected statistical characteristics. One can there- 
fore carry out the significance tests suggested earlier in 
this note. 



Zipf-behavior is at best premature, and indeed at worst 
is likely to be completely misleading and false. 
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3 Conclusions 

We have argued that an observation of Zipf-like behav- 
ior provides very little information about the nature of 
the underlying process generating such frequency data. 
This is simply because the underlying generative pro- 
cesses could be as diverse as M-sided dies, simple finite- 
state grammars, DNA sequences, and natural languages. 
Inferring that noncoding DNA sequence grammars are 
like natural language grammars solely on the basis of 



We should point out that some researchers, e.g., Searls, 
1993, maintain the contrary position and argue that natural 
language and DNA grammars share at least some generative 
processes. A discussion of this point is beyond the scope of 
this note. 



